New Remix App

pgsql-hackers

❮

interrupt processing during FATAL processing

Jump to comment-1
andres@anarazel.de2022-07-15T17:29:38+00:00
Hi, One of the problems we found ([1]) when looking at spurious failures of the recovery conflict test ([2]) is that a single backend can FATAL out multiple times. That seems independent enough from that thread that I thought it best to raise separately. The problem is that during the course of FATAL processing, we end up processing interrupts, which then can process the next recovery conflict interrupt. This happens because while we send the FATAL to the client (errfinish()->EmitErrorReport()) we'll process interrupts in various places. If e.g. a new recovery conflict interrupt has been raised (which startup.c does at an absurd rate), that'll then trigger a new FATAL. Sources for recursive processing of interrupts are e.g. < ERROR ereports like the COMERROR in internal_flush(). A similar problem does *not* exist for for ERROR, because errfinish() PG_RE_THROW()s before the EmitErrorReport() and PostgresMain() does HOLD_INTERRUPTS() first thing after sigsetjmp(). One might reasonably think that the proc_exit_inprogress logic in die(), RecoveryConflictInterrupt() etc. protects us against this - but it doesn't, because we've not yet set it when doing EmitErrorReport(), it gets set later during proc_exit(). I'm not certain what the best way to address this is. One approach would be to put a HOLD_INTERRUPTS() before EmitErrorReport() when handling a FATAL error. I'm a bit worried this might make it harder to interrupt FATAL errors when they're blocking sending the message to the client - ProcessClientWriteInterrupt() won't do its thing. OTOH, it looks like we already have that problem if there's a FATAL after the sigsetjmp() block did HOLD_INTERRUPTS(), because erfinish() won't reset InterruptHoldoffCount like it does for ERROR. Another approach would be to set proc_exit_inprogress earlier, before the EmitErrorReport(). That's a bit ugly, because it ties ipc.c more closely to elog.c, but it also "feels" correct to me. OTOH, it's at best a partial protection, because it doesn't prevent already pending interrupts from being processed. I guess we could instead add a dedicated variable indicating whether we're currently processing a FATAL error? I was considering exposin a function checking whether elog's recursion_depth is != 0, and short-circuit ProcessInterrupts() based on that, but I don't think that'd be good - we want to be able interrupt some super-noisy NOTICE or such. I suspect that we should, even if it does not address this issue, reset InterruptHoldoffCount in errfinish() for FATALs, similar to ERRORs, so that FATALs can benefit from the logic in ProcessClientWriteInterrupt(). Greetings, Andres Freund [1] https://postgr.es/m/20220701231833.vh76xkf3udani3np%40alap3.anarazel.de [2] clearly we needed that test urgently, but I can't deny regretting the consequence of having to fix the plethora of bugs it's uncovering

Lists